Download the nutrition data from github. It contains several thousand food items and some standard nutritional content (calories, fat, vitamins, etc) along with a food group designation. This data originates from the USDA and note that all items are measured on 100g, making observations inherently comparable.
load("~/Downloads/nutrition.rdata")
dim(nutrition)
## [1] 8463 28
It’s noteworthy that there are some missing values in this data, but what happens if we simply remove all NA cases…
dim(na.omit(nutrition))
## [1] 2184 28
Seems a bit extreme. Maybe we should check which columns are containing a lot of NAs…
colSums(is.na(nutrition))
## food calories protein carbohydrates total_fat
## 0 3802 0 1 0
## saturated_fat caffiene cholesterol fiber folic_acid
## 340 3485 362 660 1817
## sodium calcium iron magnesium manganese
## 83 355 146 726 2109
## niacin phosphorus potassium riboflavin selenium
## 722 616 457 698 1757
## thiamin vitamin_A vitamin_B6 vitamin_B12 vitamin_C
## 720 684 984 1233 778
## vitamin_D zinc group
## 3332 744 0
Unfortunately, the NAs are quite spread out among several measures that we probably don’t want removed (calories for example). If we were very focused on this application, we might delve a bit further to find out why there are so many NAs, and whether they should, say, be treated as 0’s instead.
nutrition <- na.omit(nutrition)
Always a good idea to explore the data a bit further…
summary(nutrition)
## food calories protein carbohydrates
## Length:2184 Min. : 0.0 Min. : 0.00 Min. : 0.0000
## Class :character 1st Qu.: 65.0 1st Qu.: 1.74 1st Qu.: 0.3375
## Mode :character Median :171.0 Median : 7.49 Median : 8.5250
## Mean :201.1 Mean :11.48 Mean :20.3643
## 3rd Qu.:308.0 3rd Qu.:20.70 3rd Qu.:27.2475
## Max. :876.0 Max. :88.32 Max. :99.7700
## total_fat saturated_fat caffiene cholesterol
## Min. : 0.000 Min. : 0.00000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.340 1st Qu.: 0.06075 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 3.060 Median : 0.78800 Median : 0.000 Median : 0.00
## Mean : 8.713 Mean : 2.90595 Mean : 6.731 Mean : 39.66
## 3rd Qu.:12.383 3rd Qu.: 3.61650 3rd Qu.: 0.000 3rd Qu.: 67.00
## Max. :99.480 Max. :61.92400 Max. :5714.000 Max. :3100.00
## fiber folic_acid sodium calcium
## Min. : 0.000 Min. : 0.000 Min. : 0.0 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 15.0 1st Qu.: 11.00
## Median : 0.800 Median : 0.000 Median : 71.0 Median : 22.00
## Mean : 2.505 Mean : 6.636 Mean : 320.4 Mean : 88.09
## 3rd Qu.: 2.800 3rd Qu.: 0.000 3rd Qu.: 325.0 3rd Qu.: 61.00
## Max. :79.000 Max. :273.000 Max. :26000.0 Max. :7364.00
## iron magnesium manganese niacin
## Min. : 0.000 Min. : 0.00 Min. : 0.0000 Min. : 0.000
## 1st Qu.: 0.440 1st Qu.: 12.00 1st Qu.: 0.0180 1st Qu.: 0.400
## Median : 1.020 Median : 21.00 Median : 0.1080 Median : 1.432
## Mean : 2.178 Mean : 40.61 Mean : 0.6526 Mean : 2.978
## 3rd Qu.: 2.320 3rd Qu.: 32.00 3rd Qu.: 0.3723 3rd Qu.: 4.769
## Max. :123.600 Max. :781.00 Max. :133.0000 Max. :80.000
## phosphorus potassium riboflavin selenium
## Min. : 0.0 Min. : 0.0 Min. :0.0000 Min. : 0.00
## 1st Qu.: 37.0 1st Qu.: 135.0 1st Qu.:0.0470 1st Qu.: 0.70
## Median : 125.0 Median : 233.5 Median :0.1385 Median : 6.30
## Mean : 175.8 Mean : 342.7 Mean :0.2126 Mean : 15.22
## 3rd Qu.: 224.0 3rd Qu.: 356.0 3rd Qu.:0.2500 3rd Qu.: 24.00
## Max. :9918.0 Max. :16500.0 Max. :6.8000 Max. :1917.00
## thiamin vitamin_A vitamin_B6 vitamin_B12
## Min. : 0.0000 Min. : 0.0 Min. :0.0000 Min. : 0.0000
## 1st Qu.: 0.0320 1st Qu.: 0.0 1st Qu.:0.0490 1st Qu.: 0.0000
## Median : 0.0700 Median : 22.0 Median :0.1290 Median : 0.0000
## Mean : 0.1855 Mean : 847.5 Mean :0.2395 Mean : 0.8791
## 3rd Qu.: 0.2000 3rd Qu.: 214.0 3rd Qu.:0.3400 3rd Qu.: 0.7225
## Max. :10.9900 Max. :77261.0 Max. :8.0000 Max. :83.1300
## vitamin_C vitamin_D zinc group
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Length:2184
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.270 Class :character
## Median : 0.15 Median : 0.00 Median : 0.770 Mode :character
## Mean : 10.45 Mean : 19.54 Mean : 1.721
## 3rd Qu.: 4.10 3rd Qu.: 5.00 3rd Qu.: 2.410
## Max. :2400.00 Max. :1123.00 Max. :90.950
This shows us that even though food amounts were standardized, it’s pretty clear that the variables measured are on vastly different scales. As such, we should be careful to scale any PCA performed.
nupca <- prcomp(nutrition[,-c(1,28)], scale.=TRUE)
summary(nupca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.1627 1.6632 1.56431 1.32432 1.27038 1.17873 1.14213
## Proportion of Variance 0.1799 0.1064 0.09412 0.06745 0.06207 0.05344 0.05017
## Cumulative Proportion 0.1799 0.2863 0.38041 0.44786 0.50993 0.56337 0.61354
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.07919 0.98092 0.95248 0.93091 0.86575 0.83505 0.81022
## Proportion of Variance 0.04479 0.03701 0.03489 0.03333 0.02883 0.02682 0.02525
## Cumulative Proportion 0.65834 0.69535 0.73024 0.76357 0.79240 0.81922 0.84446
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.77617 0.76276 0.71123 0.6511 0.62883 0.60465 0.5679
## Proportion of Variance 0.02317 0.02238 0.01946 0.0163 0.01521 0.01406 0.0124
## Cumulative Proportion 0.86764 0.89001 0.90947 0.9258 0.94098 0.95504 0.9675
## PC22 PC23 PC24 PC25 PC26
## Standard deviation 0.55599 0.46560 0.41915 0.37748 0.04831
## Proportion of Variance 0.01189 0.00834 0.00676 0.00548 0.00009
## Cumulative Proportion 0.97934 0.98767 0.99443 0.99991 1.00000
Using the Kaiser criterion would suggest retaining 8 components, but that would only retain approximately 66% of the variability in the data. If we wanted to retain 90% of the variation, then we would need to keep 17 components. We can also check a scree plot
plot(nupca, type="lines")
but by default the x-axis isn’t long enough, let’s expand
plot(nupca, type="lines", xlim=c(1,26))
hmm, that didn’t work…little bit of searching shows that the function
screeplot is the default plot method for a
prcomp object, and the argument we want to increase is
actually npcs…
plot(nupca, type="lines", npcs=26)
After all that work, there’s no clear picture of how many components to retain using a scree plot. Best argument might be for 3 components, but you’d be leaving a lot of variance on the table.
In any case, let’s take a look at the first three components
nupca$rotation[,1:3]
## PC1 PC2 PC3
## calories -0.28984450 0.071740483 -0.402817933
## protein -0.25221736 0.304280692 0.035072126
## carbohydrates -0.13433393 -0.358147046 -0.173526769
## total_fat -0.19967152 0.248340853 -0.410578696
## saturated_fat -0.13209879 0.269016073 -0.389903929
## caffiene -0.06873300 -0.137864557 0.010768664
## cholesterol -0.08102362 0.325268544 0.043743444
## fiber -0.21811173 -0.339734033 -0.055905551
## folic_acid -0.03832779 -0.104949921 -0.112380806
## sodium -0.04112324 0.017210793 -0.081622954
## calcium -0.19496728 -0.125007935 -0.085652175
## iron -0.25405163 -0.195183474 -0.002385225
## magnesium -0.31744573 -0.217159458 -0.081421582
## manganese -0.13249482 -0.218888570 -0.018023030
## niacin -0.29089205 0.136000014 0.255038615
## phosphorus -0.23840742 0.004174599 -0.090088264
## potassium -0.24781734 -0.212430668 0.066966641
## riboflavin -0.29358866 0.098618291 0.261987058
## selenium -0.11671311 0.168235553 -0.029249888
## thiamin -0.22754623 -0.008655121 0.070734365
## vitamin_A -0.07072538 -0.069177326 0.251779136
## vitamin_B6 -0.28627061 0.060930755 0.319147647
## vitamin_B12 -0.09624647 0.258476489 0.202930449
## vitamin_C -0.11195416 -0.092153385 0.278167093
## vitamin_D -0.06235645 0.173711766 0.105342221
## zinc -0.16433946 0.154609737 0.003061607
To make it more readable, let’s do some rounding
round(nupca$rotation[,1:3], 2)
## PC1 PC2 PC3
## calories -0.29 0.07 -0.40
## protein -0.25 0.30 0.04
## carbohydrates -0.13 -0.36 -0.17
## total_fat -0.20 0.25 -0.41
## saturated_fat -0.13 0.27 -0.39
## caffiene -0.07 -0.14 0.01
## cholesterol -0.08 0.33 0.04
## fiber -0.22 -0.34 -0.06
## folic_acid -0.04 -0.10 -0.11
## sodium -0.04 0.02 -0.08
## calcium -0.19 -0.13 -0.09
## iron -0.25 -0.20 0.00
## magnesium -0.32 -0.22 -0.08
## manganese -0.13 -0.22 -0.02
## niacin -0.29 0.14 0.26
## phosphorus -0.24 0.00 -0.09
## potassium -0.25 -0.21 0.07
## riboflavin -0.29 0.10 0.26
## selenium -0.12 0.17 -0.03
## thiamin -0.23 -0.01 0.07
## vitamin_A -0.07 -0.07 0.25
## vitamin_B6 -0.29 0.06 0.32
## vitamin_B12 -0.10 0.26 0.20
## vitamin_C -0.11 -0.09 0.28
## vitamin_D -0.06 0.17 0.11
## zinc -0.16 0.15 0.00
Let’s make it even more readable. Since we have lots of values here, it is often easier to view just the variables that surpass some threshold of absolute magnitude. For example…
loadi <- round(nupca$rotation[,1:3], 2)
loadi[abs(loadi)<0.2] <- NA
loadi
## PC1 PC2 PC3
## calories -0.29 NA -0.40
## protein -0.25 0.30 NA
## carbohydrates NA -0.36 NA
## total_fat -0.20 0.25 -0.41
## saturated_fat NA 0.27 -0.39
## caffiene NA NA NA
## cholesterol NA 0.33 NA
## fiber -0.22 -0.34 NA
## folic_acid NA NA NA
## sodium NA NA NA
## calcium NA NA NA
## iron -0.25 -0.20 NA
## magnesium -0.32 -0.22 NA
## manganese NA -0.22 NA
## niacin -0.29 NA 0.26
## phosphorus -0.24 NA NA
## potassium -0.25 -0.21 NA
## riboflavin -0.29 NA 0.26
## selenium NA NA NA
## thiamin -0.23 NA NA
## vitamin_A NA NA 0.25
## vitamin_B6 -0.29 NA 0.32
## vitamin_B12 NA 0.26 0.20
## vitamin_C NA NA 0.28
## vitamin_D NA NA NA
## zinc NA NA NA
Let’s start with the first component. This component has relatively low loading of nearly all the variables. Interestingly, all of the loadings (including those below the threshold) are negative. That is, every variable in the data set is loading in the same direction. Since it doesn’t undermine the underlying mathematics, let’s multiply the first component by negative 1.
loadi[,1] <- -loadi[,1]
Now we can reasonably interpret this component as something like “nutrient density”. We would expect any food item that scores high would be nutrient rich, and anything scoring low would be relatively void of nutrients (like water). Let’s see what the max and min food items are on this…and keep in mind that we would still need to multiply our scores by -1 since they were computed on the original loadings…
head(nutrition[order(-nupca$x[,1], decreasing=TRUE),1])
## [1] "Fruit-flavored drink, powder, with high vitamin C with other added vitamins, low calorie"
## [2] "Leavening agents, yeast, baker's, active dry"
## [3] "Malted drink mix, chocolate, with added nutrients, powder"
## [4] "Rice bran, crude"
## [5] "Spices, basil, dried"
## [6] "Malted drink mix, natural, with added nutrients, powder"
tail(nutrition[order(-nupca$x[,1], decreasing=TRUE),1])
## [1] "Carbonated beverage, low calorie, cola or pepper-type, with aspartame, without caffeine"
## [2] "Tea, herb, chamomile, brewed"
## [3] "Tea, herb, other than chamomile, brewed"
## [4] "Carbonated beverage, low calorie, other than cola or pepper, with aspartame, contains caffeine"
## [5] "Carbonated beverage, club soda"
## [6] "Carbonated beverage, low calorie, other than cola or pepper, without caffeine"
On first blush this appears a bit unintuitive as most of the entries on both lists are drinks. BUT READ CLOSELY! The first, third, and sixth most nutrient “dense” are actually POWDERED drink mixes. So 100g of those probably make a substantial amount of drink, and are likely packed with sugar (hence very high in calories) and in some cases it even mentions that they have added vitamins. In fact, all of the top 6 are concentrated/dried items. The bottom six, on the other hand, are liquid drinks specifically void of most nutrients…diet pop, sparkling water, and herbal tea.
Let’s move on to the second component…
loadi
## PC1 PC2 PC3
## calories 0.29 NA -0.40
## protein 0.25 0.30 NA
## carbohydrates NA -0.36 NA
## total_fat 0.20 0.25 -0.41
## saturated_fat NA 0.27 -0.39
## caffiene NA NA NA
## cholesterol NA 0.33 NA
## fiber 0.22 -0.34 NA
## folic_acid NA NA NA
## sodium NA NA NA
## calcium NA NA NA
## iron 0.25 -0.20 NA
## magnesium 0.32 -0.22 NA
## manganese NA -0.22 NA
## niacin 0.29 NA 0.26
## phosphorus 0.24 NA NA
## potassium 0.25 -0.21 NA
## riboflavin 0.29 NA 0.26
## selenium NA NA NA
## thiamin 0.23 NA NA
## vitamin_A NA NA 0.25
## vitamin_B6 0.29 NA 0.32
## vitamin_B12 NA 0.26 0.20
## vitamin_C NA NA 0.28
## vitamin_D NA NA NA
## zinc NA NA NA
Any item scoring high on PC2 will have high protein/fat/cholesterol/vitaminB, and low carbs/fiber/iron/magnesium/manganese/potassium. For the most part, it seems like a measure of “meatiness”. Let’s look at the top 6 and bottom 6 on this
head(nutrition[order(nupca$x[,2], decreasing=TRUE),1])
## [1] "Egg, yolk, dried"
## [2] "Beef, variety meats and by-products, brain, cooked, simmered"
## [3] "Beef, variety meats and by-products, liver, cooked, pan-fried"
## [4] "Egg, whole, dried"
## [5] "Beef, variety meats and by-products, liver, cooked, braised"
## [6] "Nuts, brazilnuts, dried, unblanched"
tail(nutrition[order(nupca$x[,2], decreasing=TRUE),1])
## [1] "Spices, cloves, ground"
## [2] "Spices, marjoram, dried"
## [3] "Spices, thyme, dried"
## [4] "Spices, basil, dried"
## [5] "Tea, instant, unsweetened, powder, decaffeinated"
## [6] "Tea, instant, unsweetened, powder"
For the most part our interpretation was correct (though I forgot how protein-rich and carb-less nuts were). At least one can’t get much further away from meat than unsweetened tea powder!
Finally, let’s look at component 3
loadi
## PC1 PC2 PC3
## calories 0.29 NA -0.40
## protein 0.25 0.30 NA
## carbohydrates NA -0.36 NA
## total_fat 0.20 0.25 -0.41
## saturated_fat NA 0.27 -0.39
## caffiene NA NA NA
## cholesterol NA 0.33 NA
## fiber 0.22 -0.34 NA
## folic_acid NA NA NA
## sodium NA NA NA
## calcium NA NA NA
## iron 0.25 -0.20 NA
## magnesium 0.32 -0.22 NA
## manganese NA -0.22 NA
## niacin 0.29 NA 0.26
## phosphorus 0.24 NA NA
## potassium 0.25 -0.21 NA
## riboflavin 0.29 NA 0.26
## selenium NA NA NA
## thiamin 0.23 NA NA
## vitamin_A NA NA 0.25
## vitamin_B6 0.29 NA 0.32
## vitamin_B12 NA 0.26 0.20
## vitamin_C NA NA 0.28
## vitamin_D NA NA NA
## zinc NA NA NA
Anything scoring high on this component will be low in calories and fat, but high in vitamins. So generally speaking, let’s just call it “vitamin-dense”.
head(nutrition[order(nupca$x[,3], decreasing=TRUE),1])
## [1] "Fruit-flavored drink, powder, with high vitamin C with other added vitamins, low calorie"
## [2] "Peppers, sweet, red, freeze-dried"
## [3] "Beef, variety meats and by-products, liver, cooked, pan-fried"
## [4] "Beef, variety meats and by-products, liver, cooked, braised"
## [5] "Malted drink mix, chocolate, with added nutrients, powder"
## [6] "Malted drink mix, natural, with added nutrients, powder"
tail(nutrition[order(nupca$x[,3], decreasing=TRUE),1])
## [1] "Pork, fresh, backfat, raw"
## [2] "Butter, whipped, with salt"
## [3] "Butter, without salt"
## [4] "Butter, salted"
## [5] "Nuts, coconut meat, dried (desiccated), not sweetened"
## [6] "Butter oil, anhydrous"
The bottom six are no brainers (basically just fat-based items), but the top six are again a bit strange. We have, once more, our powdered drink mixes, freeze dried peppers, and beef liver!
Let’s continue with the nutrition data. We’ll keep the same lazy cleaning again just to keep things easy.
#load("~/Downloads/nutrition.rdata")
#dim(nutrition)
#nutrition <- na.omit(nutrition)
colnames(nutrition)
## [1] "food" "calories" "protein" "carbohydrates"
## [5] "total_fat" "saturated_fat" "caffiene" "cholesterol"
## [9] "fiber" "folic_acid" "sodium" "calcium"
## [13] "iron" "magnesium" "manganese" "niacin"
## [17] "phosphorus" "potassium" "riboflavin" "selenium"
## [21] "thiamin" "vitamin_A" "vitamin_B6" "vitamin_B12"
## [25] "vitamin_C" "vitamin_D" "zinc" "group"
Now then, to set this up in a PCRegression type context, we’ll need a
natural continuous response variable. I think calories is
likely the most natural approach since they can come primarily from both
fat and carbs. As such, we will need to remove it from the original PCA
call.
nupca <- prcomp(nutrition[,-c(1,2,28)], scale.=TRUE)
summary(nupca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0905 1.6613 1.45197 1.2787 1.24728 1.17387 1.13369
## Proportion of Variance 0.1748 0.1104 0.08433 0.0654 0.06223 0.05512 0.05141
## Cumulative Proportion 0.1748 0.2852 0.36953 0.4349 0.49715 0.55227 0.60368
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.07393 0.98081 0.95224 0.92905 0.85969 0.83157 0.78987
## Proportion of Variance 0.04613 0.03848 0.03627 0.03453 0.02956 0.02766 0.02496
## Cumulative Proportion 0.64982 0.68830 0.72457 0.75909 0.78865 0.81631 0.84127
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.77604 0.76121 0.70758 0.64976 0.62872 0.59025 0.5678
## Proportion of Variance 0.02409 0.02318 0.02003 0.01689 0.01581 0.01394 0.0129
## Cumulative Proportion 0.86536 0.88854 0.90856 0.92545 0.94126 0.95520 0.9681
## PC22 PC23 PC24 PC25
## Standard deviation 0.53484 0.46515 0.40082 0.36685
## Proportion of Variance 0.01144 0.00865 0.00643 0.00538
## Cumulative Proportion 0.97954 0.98819 0.99462 1.00000
Again, using the Kaiser criterion would suggest retaining 8 components, but that would only retain approximately 65% of the variability in the data. In the previous lab we investigated the other standard approaches for component retention…let’s not bother for this lab. We’ll retain the first 8 components. So, let’s pop the scores of the original observations on those 8 components into an object for future use.
scr <- nupca$x[,1:8]
Now we can run PCReg for calories as a response…
pcregmod <- lm(nutrition$calories ~ scr)
summary(pcregmod)
##
## Call:
## lm(formula = nutrition$calories ~ scr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -624.91 -40.83 -10.22 29.49 236.46
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 201.1200 1.5068 133.475 < 2e-16 ***
## scrPC1 35.9805 0.7210 49.907 < 2e-16 ***
## scrPC2 -5.0377 0.9072 -5.553 3.15e-08 ***
## scrPC3 51.3592 1.0380 49.479 < 2e-16 ***
## scrPC4 26.6879 1.1787 22.642 < 2e-16 ***
## scrPC5 55.3065 1.2083 45.770 < 2e-16 ***
## scrPC6 20.3169 1.2839 15.824 < 2e-16 ***
## scrPC7 -28.3220 1.3294 -21.304 < 2e-16 ***
## scrPC8 22.3124 1.4034 15.899 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 70.42 on 2175 degrees of freedom
## Multiple R-squared: 0.7969, Adjusted R-squared: 0.7962
## F-statistic: 1067 on 8 and 2175 DF, p-value: < 2.2e-16
Cool! Everything appears interesting and useful for modelling. Of course, our PCA was done unsupervised — therefore, without respect to what we were trying to model (calories). So perhaps PLS can help us improve…
We will need the pls library here
#install.packages("pls")
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
Now, we will feed the function plsr a formula for
calories as a response to all other numeric predictors.
nuplsmod <- plsr(calories~., data=nutrition[,-c(1,28)], scale=TRUE, method="oscorespls")
summary(nuplsmod)
## Data: X dimension: 2184 25
## Y dimension: 2184 1
## Fit method: oscorespls
## Number of components considered: 25
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 14.53 24.29 29.77 34.48 42.06 47.18 51.29
## calories 69.29 89.08 94.64 97.56 98.41 99.10 99.35
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 53.93 58.15 62.05 65.75 68.60 71.63 74.91
## calories 99.48 99.51 99.51 99.51 99.52 99.52 99.52
## 15 comps 16 comps 17 comps 18 comps 19 comps 20 comps 21 comps
## X 77.47 79.78 82.71 84.67 86.75 89.30 92.12
## calories 99.52 99.52 99.52 99.52 99.52 99.52 99.52
## 22 comps 23 comps 24 comps 25 comps
## X 94.66 96.10 97.89 100.00
## calories 99.52 99.52 99.52 99.52
This gives a very different picture…namely that it appears probably 2
components are sufficient (cumulatively explain just over 89% of the
variation in calories). Compare that to PCReg — 8
components explained only 79% of the variation in calories.
Clearly an improvement. In fact, if we only take the first two
components for PCReg…
pcregmod2 <- lm(nutrition$calories ~ scr[,1:2])
summary(pcregmod2)
##
## Call:
## lm(formula = nutrition$calories ~ scr[, 1:2])
##
## Residuals:
## Min 1Q Median 3Q Max
## -887.77 -92.51 -45.45 74.04 646.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 201.120 2.920 68.888 <2e-16 ***
## scr[, 1:2]PC1 35.981 1.397 25.758 <2e-16 ***
## scr[, 1:2]PC2 -5.038 1.758 -2.866 0.0042 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 136.4 on 2181 degrees of freedom
## Multiple R-squared: 0.2355, Adjusted R-squared: 0.2348
## F-statistic: 335.8 on 2 and 2181 DF, p-value: < 2.2e-16
Only 23%! Hence why PLS is generally a superior approach in a supervised context.
But how would even PLS compare to a domain-wise and straightforward model…
dwmod <- lm(calories~total_fat+carbohydrates, data=nutrition)
summary(dwmod)
##
## Call:
## lm(formula = calories ~ total_fat + carbohydrates, data = nutrition)
##
## Residuals:
## Min 1Q Median 3Q Max
## -151.58 -40.60 -10.91 36.02 310.91
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.95804 1.47220 33.93 <2e-16 ***
## total_fat 9.63835 0.07792 123.69 <2e-16 ***
## carbohydrates 3.29893 0.03917 84.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.3 on 2181 degrees of freedom
## Multiple R-squared: 0.9081, Adjusted R-squared: 0.908
## F-statistic: 1.078e+04 on 2 and 2181 DF, p-value: < 2.2e-16
Just slightly more of the variation (90.8% versus 89.1%) in
calories can be explained using an equivalent
dimensionality from the original variables.
In machine learning, there is often no substitute for domain-knowledge!
Let’s simulate a weird data set with 2 clusters and a line of points running in between. We’ll use manhattan distance for no particular reason other than showing how to specify it…
library(MASS)
set.seed(3173)
datagen1 <- mvrnorm(25, c(0,0), matrix(c(1,0,0,1),2,2))
datagen2 <- mvrnorm(25, c(5,-5), matrix(c(1,0,0,1),2,2))
datagen <- rbind(datagen1, datagen2)
x1 <- seq(from=0, to=5, by=0.2)
x2 <- seq(from=0, to=-5, by=-0.2)
newvalx <- cbind(x1,x2)
datagen <- rbind(datagen,newvalx)
plot(datagen)
mandist <- dist(datagen,method="manhattan")
clus1 <- hclust(mandist, method="single")
plot(clus1)
Using single linkage we have major difficulty finding an underlying group structure. The plot above is a tell-tale sign of the phenomena called “chaining” — essentially that you keep adding observations to one group for the majority of the algorithm. This problem tends to only be a concern with single linkage…
clus2 <- hclust(mandist, method="average")
plot(clus2)
Note that switching to euclidean distance doesn’t fix the chaining issue, but does result in slightly different groups
eucdist <- dist(datagen,method="euclidean")
clus3 <- hclust(eucdist, method="single")
plot(clus3)
#Euclidean, complete
clus4 <- hclust(mandist, method="complete")
plot(clus4)
Here we specify where to cut the tree by requesting a specific number of groups, and then we can compare results between the manhattan and euclidean clusterings…
mancom <- cutree(clus2,2)
euccom <- cutree(clus4,2)
table(mancom,euccom)
## euccom
## mancom 1 2
## 1 39 0
## 2 0 37
No observations are grouped differently, here’s a plot for complete linkage on Manhattan distance.
plot(datagen, col=mancom)
Now euclidean with single linkage, and we will compare the results with euclidean with complete linkage.
eucsin <- cutree(clus3,2)
table(euccom,eucsin)
## eucsin
## euccom 1 2
## 1 39 0
## 2 35 2
Single linkage makes two groups, one with only two observations in it!
Let’s code up k-means from scratch. We should be able to provide the
function with the data x and the number of groups
k.
It’s probably best to remind you of the steps of the kmeans algorithm
first: 1. Start with k random centroids (let’s use points
within the data) 2. Assign all observations to their closest centroid
(by euclidean distance) 3. Recalculate group means. Call these your new
centroids. 4. Repeat 2, 3 until nothing changes.
There are several ways of going about this. Sometimes its easier to just code the actual steps and then back up to setup the loop, etc, but I’ll provide the whole thing, with some comments to point out what we’re accomplishing with each chunk of the code.
my_k_means <- function(x, k){
#1 start with k centroids
centrs <- x[sample(1:nrow(x), k),]
#start loop
changing <- TRUE
while(changing){
#2a) calculate distances between all x and centrs
dists <- matrix(NA, nrow(x), k)
#could write as a double loop, or could utilize other built in functions probably
for(i in 1:nrow(x)){
for(j in 1:k){
dists[i, j] <- sqrt(sum((x[i,] - centrs[j,])^2))
}
}
#2b) assign group memberships (you probably want to provide some overview of apply functions)
membs <- apply(dists, 1, which.min)
#3) calculate new group centroids
oldcentrs <- centrs #save for convergence check!
for(j in 1:k){
centrs[j,] <- colMeans(x[membs==j, ])
}
#4) check for convergence
if(all(centrs==oldcentrs)){
changing <- FALSE
}
}
#output memberships
membs
}
set.seed(5314)
#debug(my_k_means) #if you want
test <- my_k_means(scale(faithful), 2)
plot(faithful, col=test)
Compare to the built in version…
test2 <- kmeans(scale(faithful), 2)
plot(faithful, col=test2$cl)
We’ll talk more about this in the next week or so, but here’s a very simple way to see some of the hidden assumptions of k-means in action.
library(mvtnorm)
set.seed(35151)
le <- rmvnorm(400, mean = c(-5,7.5))
re <- rmvnorm(400, mean = c(5,7.5))
hd <- rmvnorm(400, mean = c(0,0), sigma=7*diag(2) )
dat <- rbind(le, re, hd)
plot(dat)
mickres <- my_k_means(scale(dat), 3)
plot(dat, col=mickres)
Note how the smaller mouse ears are assumed larger in k-means, even though the groups are relatively well-separated. We can see this by showing a side by side of the truth and what k-means just gave us
par(mfrow=c(1,2))
plot(dat, col=mickres)
plot(dat, col=rep(c(1,3,2), each=400))
What about heirarchical?
mickdist <- dist(scale(dat))
mickhc <- hclust(mickdist, method="average")
mickhcres <- cutree(mickhc, 3)
plot(dat, col=mickhcres)
Still not perfect, but not as bad. Some of the other linkage options are worse than what results from k-means as well.
Now that we have some performance metrics and a few models under our belt, let’s do some competing of classification models on the body data set.
library(gclus)
## Loading required package: cluster
data(body)
body$Gender <- factor(body$Gender)
First up, random forests!
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
set.seed(4521)
bodyRF <- randomForest(Gender~., data=body)
bodyRF
##
## Call:
## randomForest(formula = Gender ~ ., data = body)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 3.75%
## Confusion matrix:
## 0 1 class.error
## 0 250 10 0.03846154
## 1 9 238 0.03643725
We can see from the output several measures from the OOB observations. Misclassification rate is 0.0375. Now, given the lack of information in the printout (and the help file), we may want to verify whether the columns or rows of the confusion matrix are the ground truth! Easy…
table(body$Gender)
##
## 0 1
## 260 247
So the response variable has 247 ones (males). Thus from the printed results, we can deduce that the rows are the truth and the columns the prediction. We can use the MLmetrics library to compute some of the measures we’ve discussed.
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
Note that the LogLoss function is specifically for
binary…there is also MultiLogLoss from the same library for
multiclass scenarios.
Judging from help files (fairly sparse), my best guess for the binary version is that we feed it a probability of classifying 1 as the predicted vector…
LogLoss(predict(bodyRF, type="prob")[,2], body$Gender)
## Warning in Ops.factor(y_true, log(y_pred)): '*' not meaningful for factors
## Warning in Ops.factor(1, y_true): '-' not meaningful for factors
## [1] NA
Here’s an aggravation. Looks like they want the 0-1 true classification vector as numeric for computation purposes. Here’s a quick hack back…
LogLoss(predict(bodyRF, type="prob")[,2], as.numeric(body$Gender)-1)
## [1] 0.1141382
It’s worth pointing out that all of these measures are believable long-run estimates for those metrics since they are computed on the OOB classifications/probabilities.
Now LDA, note the CV=TRUE command. The help file
suggests that this will provide us classifications and probabilities
that correspond to LOOCV.
library(MASS)
bodylda <- lda(Gender~., data=body, CV=TRUE)
table(body$Gender, bodylda$class)
##
## 0 1
## 0 257 3
## 1 6 241
So misclassification rate of 9/507 = 0.018.
Now for logloss…
LogLoss(bodylda$posterior[,2], as.numeric(body$Gender)-1)
## [1] 0.06311807
Comparing back to random forests, we can see improvement. Thus providing us an argument that LDA is a better long-run option (since these are cross-validated results).
Since sample sizes are large enough, it’s likely natural to remove the equal covariance matrix assumption of LDA in favour of QDA. Let’s see what happens!
bodyqda <- qda(Gender~., data=body, CV=TRUE)
table(body$Gender, bodyqda$class)
##
## 0 1
## 0 256 4
## 1 6 241
Well, that’s not promising. One more misclassification!
LogLoss(bodyqda$posterior[,2], as.numeric(body$Gender)-1)
## [1] 0.1020329
The LogLoss is even more damning, as it approaches the LogLoss of the random forests model (which had significantly more misclassifications). This suggests that the misclassified observations may be under much more false confidence in QDA.
How about k nearest neighbours? Let’s just try for k=1 through k=300…
library(class)
knnruns <- list()
for(i in 1:300){
knnruns[[i]] <- knn.cv(body[,-25], body$Gender, k=i, prob=TRUE)
}
misclass <- unlist(lapply(knnruns, function(v) 507-sum(diag(table(body$Gender, v)))))
plot(misclass)
which.min(misclass)
## [1] 3
These CV runs suggest that k=3 is best for long-run misclassifications. Now we can take the results from k=3 and check all of our other metrics…
knnbest <- knnruns[[3]]
table(body$Gender, knnbest)
## knnbest
## 0 1
## 0 253 7
## 1 4 243
The probabilities returned from KNN are given for only the class
which was predicted. This is fine for computing LogLoss when doing
binary classification…just means a little bit of manual work. Basically
we can sum up the -log(probs) of the returned probabilities
for all the correctly classified, and -log(1-probs) for the
11 misclassifications we see in the table, and then divide by
n…
probs <- attr(knnbest, "prob")
(sum(-log(probs[body$Gender==knnbest])) + sum(-log(1-probs[body$Gender!=knnbest])))/507
## [1] Inf
Shoot…an Inf. This means we are 100% confident in at least one incorrect classification. It also means we need a little more work to get a more reasonable LogLoss estimate. We will force the value to be the maximum of the predicted probability or 1e-15…
missedprobs <- 1-probs[body$Gender!=knnbest]
missedprobs[missedprobs==0] <- 1e-15
probs[probs==0] <- 1e-15
(sum(-log(probs[body$Gender==knnbest])) + sum(-log(missedprobs)))/507
## [1] 0.3052576
Well that looks pretty bad! But wait, we replaced 0’s with an extremely small value (1e-15), so note that
-log(1e-15)
## [1] 34.53878
is likely ballooning the average substantially. How much of an effect? Well, let’s change to 1e-3…
probs <- attr(knnbest, "prob")
missedprobs <- 1-probs[body$Gender!=knnbest]
missedprobs[missedprobs==0] <- 1e-3
probs[probs==0] <- 1e-3
(sum(-log(probs[body$Gender==knnbest])) + sum(-log(missedprobs)))/507
## [1] 0.08726142
Which is much more in line with some of the other models. Though, perhaps models should be more heavily penalized when the say with certainty that there is NO CHANCE that an observation belongs to the CORRECT GROUP!
In any case, we can see that even though KNN results in the same number of misclassifications as LDA, it is overconfident in those incorrect classifications (and also perhaps underconfident in some of its correct choices).
All signs point to LDA as the best model on this data set. If you take a look at some scatterplots, it does seem that normality could possibly hold (approximately) for a lot of the predictor variables, and so in this sense it’s probably not all that surprising that LDA works very well.
Furthermore, the assumption of equal covariance matrices for physical measurements on both men and women is not really an outrageous assumption to make. An example of the ramifications of this assumption: on average, any increase in height will lead to an equivalent increase in weight, regardless of reported gender (even if there are different mean vectors across the reported genders).
Keeping in mind that mixture model families can often take some time to fit, we will restrict the number of models considered during many of the model fits for brevity of the lab. I certainly encourage you to explore the softwares in more depth.
First, let’s run the popular mclust package on the
faithful data set
#install.packages("mclust")
#install.packages("teigen")
library(mclust)
## Package 'mclust' version 6.0.0
## Type 'citation("mclust")' for citing this R package in publications.
##
## Attaching package: 'mclust'
## The following object is masked from 'package:mvtnorm':
##
## dmvnorm
mfaith <- Mclust(faithful, G=1:5)
Rather than scrolling a large table of values, you can view a plot of the BIC values. The X-axis is the number of groups for the clustering solution (called components of the mixture model), and the Y-axis is the BIC (model fitting measure). Each of the covariance structures we discussed in class has a different line on the plot.
plot(mfaith, what="BIC")
The classification plot will provide a visual of each mixture component overlaid on top of the fitted groups.
plot(mfaith, what="classification")
The uncertainty plot gives you a visual of which observations have relatively unclear classification. The larger the point appears, the more uncertain in the classification we are.
plot(mfaith, what="uncertainty")
Finally, the density plot will provide the contours of the actual fitted mixture model.
plot(mfaith, what="density")
Other software is available for similar models under different distributional assumptions. Here’s an example on some of Jeff’s software, which uses the multivariate t distribution and eigendecomposition constraints on the covariance matrix.
library(teigen)
tfaith <- teigen(faithful, Gs=1:5, model="UUUU", scale=FALSE, verbose = FALSE)
plot(tfaith, what="uncertainty")
plot(tfaith, what="contour")
The faithful data set is a true unsupervised problem as we don’t have known labels. But we can take a look at some of our supervised data sets to investigate how these clustering methods perform with respect to known groups…
library(gclus)
data(wine)
twine <- teigen(wine[,-1], Gs=1:5, model="UCCU", scale=FALSE, verbose=FALSE)
plot(twine, xmarg=1, ymarg=7, what="contour")
plot(twine, xmarg=1, ymarg=10, what="uncertainty")
table(wine[,1], twine$classification)
##
## 1 2 3
## 1 0 0 59
## 2 66 5 0
## 3 0 48 0
plot(twine$allbic, type="l")
points(twine$allbic)
And so we can see (for at least one covariance structure from tEIGEN), that the G=3 group model is chosen on the wine data via the BIC, and the resulting clusters match quite closely to the known group varietals in the data (5 misclassifications).
We can check the ARI using the function from mclust…
adjustedRandIndex(wine[,1], twine$classification)
## [1] 0.9188052